Skip to content

Add pipeline checkpoint retention strategies#683

Merged
FurtherAI merged 4 commits into
mainfrom
austin/checkpoint_retention_strategy
May 20, 2026
Merged

Add pipeline checkpoint retention strategies#683
FurtherAI merged 4 commits into
mainfrom
austin/checkpoint_retention_strategy

Conversation

@FurtherAI
Copy link
Copy Markdown
Collaborator

Summary

  • Adds configurable checkpoint retention to PipelineTrainer
  • Introduces keep_recent_and_top(...), where strategies return eligible checkpoint steps to keep
  • Protects current, leased, and scheduled-eval checkpoints from deletion
  • Records checkpoint retention metadata in history.jsonl without changing normal metric routing
  • Aligns Unsloth loaded-LoRA adapter pruning with Megatron so both unload non-retained vLLM adapters

Validation

  • ruff check
  • ty check
  • git diff --check
  • pytest tests/unit/test_metric_routing.py tests/unit/test_checkpoint_retention.py tests/unit/test_pipeline_trainer_local_backend.py tests/unit/test_multi_checkpoint_inference.py
  • scratch smoke test for actual local checkpoint dir deletion and lease protection

@FurtherAI FurtherAI requested a review from arcticfly May 19, 2026 18:09
@FurtherAI FurtherAI merged commit e9fdf04 into main May 20, 2026
5 checks passed
@FurtherAI FurtherAI deleted the austin/checkpoint_retention_strategy branch May 20, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants